Univariate Plots Section

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

可以看到有1599条样本,每条样本有13个变量

绘制所有变量的直方图

绘制quality的直方图

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

可以看到,大部分(82%)的quality评分在5,6分

绘制fixed.acidity的直方图

## 
##  7.2  7.1  7.8  7.5    7  7.7  6.8  7.6  8.2  7.3  7.4  7.9    8  8.3  6.9 
##   67   57   53   52   50   49   46   46   45   44   44   42   42   40   38 
##  6.6  8.8  8.9  9.1  6.7  8.6  8.1  8.4    9  9.9  6.4  8.7   10  9.3 10.4 
##   37   34   33   29   28   27   26   26   26   26   25   24   23   22   21 
##  6.2  8.5 10.2  6.5  9.4  9.6  6.1  9.2  9.8  5.6  6.3  9.5 10.6    6 11.5 
##   20   19   19   17   17   17   16   16   15   14   14   14   14   13   13 
## 10.5 11.6 11.9 10.3 10.1 10.7 10.8  5.9  9.7 11.1 10.9 11.3   12 12.5    5 
##   12   12   12   11   10   10   10    9    9    9    8    7    7    7    6 
##  5.2  5.4 11.2 11.4 12.3 12.8  5.1  5.3  5.8 12.2 12.4 12.6 12.7   11 11.7 
##    6    5    5    5    5    5    4    4    4    4    4    4    4    3    3 
## 11.8   13 13.2 13.3  5.7 12.9 13.7   15 15.5 15.6  4.6  4.7  4.9  5.5 12.1 
##    3    3    3    3    2    2    2    2    2    2    1    1    1    1    1 
## 13.4 13.5 13.8   14 14.3 15.9 
##    1    1    1    1    1    1

可以看到,fixed.acidity的峰值出现在7.2,在16附近出现了一些异常值

绘制volatile.acidity的直方图

## 
##   0.6   0.5  0.43  0.59  0.36  0.58   0.4  0.38  0.39  0.49  0.56  0.41 
##    47    46    43    39    38    38    37    35    35    35    34    33 
##  0.52  0.42  0.46  0.54  0.31  0.34  0.53  0.63  0.57  0.61  0.64  0.66 
##    33    31    31    31    30    30    29    29    28    27    27    26 
##  0.37  0.48  0.51  0.62  0.28  0.32  0.44  0.67  0.69  0.35  0.45  0.47 
##    24    24    24    24    23    23    23    23    23    22    22    21 
##  0.33  0.55  0.26  0.29   0.3  0.65  0.27  0.24 0.645  0.68 0.715 0.685 
##    20    20    16    16    16    16    14    13    12    12    12    11 
##  0.74  0.18   0.7  0.78 0.635 0.725 0.735 0.785  0.84  0.25 0.655 0.695 
##    11    10    10    10     9     9     8     8     8     7     7     7 
##  0.21  0.22 0.615 0.705  0.73  0.75  0.77  0.23 0.545  0.72 0.745  0.76 
##     6     6     6     6     6     6     6     5     5     5     5     5 
## 0.765  0.82  0.88 0.885 0.775  0.83 0.835  0.87 0.915  1.02  0.12   0.2 
##     5     5     5     5     4     4     4     4     4     4     3     3 
## 0.415 0.575 0.585 0.605 0.625 0.665 0.675  0.71 0.755   0.8 0.815 0.855 
##     3     3     3     3     3     3     3     3     3     3     3     3 
##   0.9  0.91  0.96 0.965  0.98     1  1.04  0.16  0.19 0.305 0.315 0.365 
##     3     3     3     3     3     3     3     2     2     2     2     2 
## 0.395 0.475  0.79 0.795  0.81  0.85  0.86 0.875 0.935  1.33 0.295 0.565 
##     2     2     2     2     2     2     2     2     2     2     1     1 
## 0.595 0.805 0.825 0.845 0.865  0.89 0.895  0.92  0.95 0.955 0.975 1.005 
##     1     1     1     1     1     1     1     1     1     1     1     1 
##  1.01 1.025 1.035  1.07  1.09 1.115  1.13  1.18 1.185  1.24  1.58 
##     1     1     1     1     1     1     1     1     1     1     1

volatile.acidity的峰值出现在0.6, 在1.6左右出现了异常值

移除1%的异常值,再次绘制直方图

出现了近似对称的双峰直方图

绘制citric.acid的直方图

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

发现132个0值,和一个为1的异常值,这是一个多峰分布

绘制residual.sugar的直方图

## 
##    2  2.2  1.8  2.1  1.9  2.3  2.4  2.5  2.6  1.7  1.6  2.8  2.7  1.4  1.5 
##  156  131  129  128  117  109   86   84   79   76   58   49   39   35   30 
##    3  2.9  3.2  3.4  3.3    4  1.2  3.6  3.8  4.3  5.5  3.1  3.9  4.1  4.6 
##   25   24   15   15   11   11    8    8    8    8    8    7    6    6    6 
##  5.6  1.3  4.2  5.1  3.7  4.4  4.5  5.8    6  6.1  4.8  5.2  5.9  6.2  6.4 
##    6    5    5    5    4    4    4    4    4    4    3    3    3    3    3 
##  7.9  8.3  0.9 1.65 1.75 2.05 2.15  3.5 4.65  6.3 6.55  6.6  6.7  7.8  8.1 
##    3    3    2    2    2    2    2    2    2    2    2    2    2    2    2 
##  8.8   11 13.8 15.4 2.25 2.35 2.55 2.65 2.85 2.95 3.45 3.65 3.75 4.25  4.7 
##    2    2    2    2    1    1    1    1    1    1    1    1    1    1    1 
##    5 5.15  5.4  5.7    7  7.2  7.3  7.5  8.6  8.9    9 10.7 12.9 13.4 13.9 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 15.5 
##    1

峰值出现在2,有很长的长尾

对residual.sugar做对数变换,然后再次绘制直方图

绘制chlorides的直方图

## 
##  0.08 0.074 0.076 0.078 0.084 0.071 0.077 0.082 0.075 0.079 0.081  0.07 
##    66    55    51    51    49    47    47    46    45    43    40    35 
## 0.073 0.083 0.066 0.088 0.086 0.068 0.067 0.085 0.087 0.089 0.062 0.072 
##    35    35    32    32    31    30    27    25    25    25    24    24 
## 0.065 0.095 0.063 0.092 0.069  0.09 0.093 0.064 0.091 0.094 0.096 0.097 
##    23    23    22    22    21    21    21    20    19    19    18    18 
## 0.059  0.06 0.104 0.058 0.054   0.1  0.05 0.098 0.061 0.114 0.052 0.057 
##    17    16    16    14    13    13    12    12    11    11    10    10 
## 0.102 0.056 0.107 0.048 0.049 0.055 0.099 0.106  0.11 0.118 0.103 0.111 
##    10     9     9     8     8     8     8     8     8     8     7     7 
## 0.122 0.105 0.112 0.123 0.044 0.053 0.101 0.115 0.039 0.041 0.045 0.046 
##     7     6     6     6     5     5     5     5     4     4     4     4 
## 0.047 0.117 0.132 0.042 0.109 0.119  0.12 0.124 0.157 0.166 0.214 0.415 
##     4     4     4     3     3     3     3     3     3     3     3     3 
## 0.012 0.038 0.116 0.121 0.152 0.171 0.178 0.205 0.226 0.414 0.034 0.043 
##     2     2     2     2     2     2     2     2     2     2     1     1 
## 0.051 0.108 0.113 0.125 0.126 0.127 0.128 0.136 0.137 0.143 0.145 0.146 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 0.147 0.148 0.153 0.159 0.161 0.165 0.168 0.169  0.17 0.172 0.174 0.176 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 0.186  0.19 0.194   0.2 0.213 0.216 0.222  0.23 0.235 0.236 0.241 0.243 
##     1     1     1     1     1     1     1     1     1     1     1     1 
##  0.25 0.263 0.267  0.27 0.332 0.337 0.341 0.343 0.358  0.36 0.368 0.369 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 0.387 0.401 0.403 0.413 0.422 0.464 0.467  0.61 0.611 
##     1     1     1     1     1     1     1     1     1

峰值处在在0.08,有很长的长尾

对chlorides做对数变换,然后再次绘制直方图

绘制free.sulfur.dioxide的直方图

## 
##   6   5  10  15  12   7 
## 138 104  79  78  75  71

free.sulfur.dioxide峰值出现在6,有长尾并出现了一些异常值

绘制total.sulfur.dioxide的直方图

## 
## 28 24 15 18 23 14 
## 43 36 35 35 34 33

free.sulfur.dioxide峰值出现在28,有长尾并出现了一些异常值。他和free.sulfur.dioxide分布类似,我觉得这两个变量具有相关性。

绘制density的直方图

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

正态分布,中位数0.9968,均值0.9967

绘制pH的直方图

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

正态分布,中位数3.310,均值3.311

绘制sulphates的直方图 有长尾,并且有异常值,用对数转换为近似正态分布,峰值出现在0.6附近

绘制alcohol的长尾

## 
##  9.5  9.4  9.8  9.2   10 10.5 
##  139  103   78   72   67   67

峰值出现在9.5,这个直方图的形状类似total.sulfur.dioxide和free.sulfur.dioxide

Univariate Analysis

What is the structure of your dataset?

这个样本集有1599条样本,每条样本有13个变量。有一个quality的因子变量,范围从1到10 1. 变量citric.acid含有大量的0值 2. 变量density和pH服从正态分布 3. 变量residual.sugar,chlorides和sulphates有很长的长尾 4. 大部分(82%)的quality评分在5,6分

What is/are the main feature(s) of interest in your dataset?

主要关心quality变量,想知道有哪些因素影响这个变量

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

可以看出quality和alcohol,volatile acidity,sulphates和citric acid相关性比较大

# quality/alcohol boxplot
qplot(x = quality, y = alcohol, data = wine, geom = 'boxplot')

# 根据quality显示alcohol的summary
by(wine$alcohol, wine$quality, summary)
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

可以看出高quality的红酒相应的alcohol也高。除了quality为5的红酒,其他红酒的alcohol的中位数呈现升高的趋势,而且quality为5的红酒的异常值有很多。我觉得可能是样本的错误。

# quality/volatile.acidity boxplot
qplot(x = quality, y = volatile.acidity, data = wine, geom = 'boxplot')

# 根据quality显示volatile.acidity的summary
by(wine$volatile.acidity, wine$quality, summary)
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

可以看出,volatile.acidity和quality呈现负相关。随着quality的提高,volatile.acidity的中位数相应的降低,但quality为7,8的变化不明显。总的来说,好的红酒volatile.acidity比较低。

# quality/sulphates boxplot
qplot(x = quality, y = sulphates, data = wine, geom = 'boxplot')

# 根据quality显示sulphates的summary
by(wine$sulphates, wine$quality, summary)
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

可以看出随着quality的提高,sulphates也相应提高。但quality为5,6的样本中出现很多的异常值,也许是由于样本的错误,所以我们不能说sulphates和quality有相关性,只能说sulphates可能对红酒口味有影响。

# quality/citric.acid boxplot
qplot(x = quality, y = citric.acid, data = wine, geom = 'boxplot')

# 根据quality显示citric.acid的summary
by(wine$citric.acid, wine$quality, summary)
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

可以看出随着红酒quality的提高,citric.acid也相应提高,他们是正相关的。一个有趣的现象,quality为3,4的,quality为5,6,quality为7,8的中位数很接近。

# boxplot for the others
qplot(x = quality, y = fixed.acidity, data = wine, geom = 'boxplot') +
  ylim(quantile(wine$fixed.acidity, 0.05), quantile(wine$fixed.acidity, 0.95))
## Warning: Removed 149 rows containing non-finite values (stat_boxplot).

qplot(x = quality, y = residual.sugar, data = wine, geom = 'boxplot') +
  ylim(0, quantile(wine$residual.sugar, 0.95))
## Warning: Removed 79 rows containing non-finite values (stat_boxplot).

qplot(x = quality, y = chlorides, data = wine, geom = 'boxplot') +
  ylim(quantile(wine$chlorides, 0.05), quantile(wine$chlorides, 0.95))
## Warning: Removed 171 rows containing non-finite values (stat_boxplot).

qplot(x = quality, y = free.sulfur.dioxide, data = wine, geom = 'boxplot') +
  ylim(0, quantile(wine$free.sulfur.dioxide, 0.95))
## Warning: Removed 77 rows containing non-finite values (stat_boxplot).

qplot(x = quality, y = total.sulfur.dioxide, data = wine, geom = 'boxplot') +
  ylim(0, quantile(wine$total.sulfur.dioxide, 0.95))
## Warning: Removed 80 rows containing non-finite values (stat_boxplot).

qplot(x = quality, y = density, data = wine, geom = 'boxplot')

  ylim(quantile(wine$density, 0.05), quantile(wine$density, 0.95))
## <ScaleContinuousPosition>
##  Range:  
##  Limits: 0.994 --    1
qplot(x = quality, y = pH, data = wine, geom = 'boxplot')

  ylim(quantile(wine$pH, 0.05), quantile(wine$pH, 0.95))
## <ScaleContinuousPosition>
##  Range:  
##  Limits: 3.06 -- 3.57

可以看到,density,pH,fixed.acidity和quality直接也有相关性,quality高的红酒相应的fixed.acidity也高,quality高的红酒相应的density和pH低

从相关性矩阵,可以看出其他非quality变量直接也有相关性 1. Fixed acidity vs citric acid (0.67) 2. Volatile acidity vs citric acid (-0.55) 3. Fixed acidity vs density (0.67) 4. Fixed acidity vs pH (-0.68) 5. Citric acid vs pH (0.67) 6. Free sulfur dioxide vs total sulfur dioxide (0.67)

# scatterplot for citric acid and fixed acidity
ggplot(data = wine, aes(x = citric.acid, y = fixed.acidity)) +
  geom_jitter(alpha=1/3, color = 'blue') +
  geom_smooth(method='lm', color='red')

# scatterplot for citric acid and volatile acidity
ggplot(data = wine, aes(x = citric.acid, y = volatile.acidity)) +
  geom_jitter(alpha=1/3, color = 'blue') +
  geom_smooth(method='lm', color='red')

# scatterplot for fixed acidity and density
ggplot(data = wine, aes(x = fixed.acidity, y = density)) +
  geom_jitter(alpha=1/3, color = 'blue') +
  geom_smooth(method='lm', color='red')

# scatterplot for fixed acidity and pH
ggplot(data = wine, aes(x = fixed.acidity, y = pH)) +
  geom_jitter(alpha=1/3, color = 'blue') +
  geom_smooth(method='lm', color='red')

# scatterplot for citric acid and pH
ggplot(data = wine, aes(x = citric.acid, y = pH)) +
  geom_jitter(alpha=1/3, color = 'blue') +
  geom_smooth(method='lm', color='red')

# scatterplot for total and free sulfur dioxide
ggplot(data = wine, aes(x = total.sulfur.dioxide, y = free.sulfur.dioxide)) +
  geom_jitter(alpha=1/3, color = 'blue') +
  geom_smooth(method='lm', color='red')

散点图显示了fixed acidity和citric acid有强烈的正相关,一个增加另外一个增加;Volatile acidity和citric acid有负相关,一个增加另外一个减少;density和fixed.acidity有着强烈的正相关,一个增加另外一个增加。

pH和fixed acidity以及citric acid之间存在负相关,一个增加另外一个减少, 这个符合酸性的常识。

total sulfur dioxide和free sulfur dioxide正相关,以为total sulfur dioxide包含了free sulfur dioxide, 所以一个增加另外一个也增加。

# Plot the scatterplot for chlorides and sulphates
ggplot(data = wine, aes(x = chlorides, y = sulphates)) +
  geom_jitter(alpha=1/3, color = 'blue') +
  geom_smooth(method='lm', color='red')

# Plot the scatterplot for chlorides and sulphates
# which excludes the top 5% values
ggplot(data = wine, aes(x = chlorides, y = sulphates)) +
  geom_jitter(alpha=1/3, color = 'blue') +
  xlim(0, quantile(wine$chlorides, 0.95)) +
  ylim(0, quantile(wine$sulphates, 0.95)) +
  geom_smooth(method='lm', color='red')
## Warning: Removed 131 rows containing non-finite values (stat_smooth).
## Warning: Removed 134 rows containing missing values (geom_point).

# Find the correlation coefficient of chlorides/sulphates with top 5% removed
with(subset(wine, chlorides < quantile(wine$chlorides, 0.95) & 
              sulphates < quantile(wine$sulphates, 0.95)),
     cor.test(chlorides, sulphates))
## 
##  Pearson's product-moment correlation
## 
## data:  chlorides and sulphates
## t = -2.0202, df = 1466, p-value = 0.04354
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.103573750 -0.001532528
## sample estimates:
##         cor 
## -0.05269068

可以看出chlorides和sulphates不是真的相关。他们的相关系数是??,但是删除5%的异常值后,相关系数变成了-0.05

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality和alcohol(0.48),volatile acidity(-0.39), sulphates (0.25),citric acid (0.23)正相关

高质量的红酒含有酒精值也更高

高质量的红酒有更低的volatile acidity

Quality和sulphates貌似有正相关,但是当Quality为5时出现了很多异常值

低Quality(3,4)的红酒citric acid含量很低;中等Quality(5,6)的红酒大约0.25 g/dm^3的citric acid;高Quality(7,8)的红酒citric acid含量超过0.25 g/dm^3。

高Quality的红酒含有的density和pH更低。

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

高fixed acidity的红酒citric acid也高,更高的citric acid相应的红酒质量更高。volatile acidity和fixed acidity负相关,高volatile acidity的红酒导致红酒的quality更低。

What was the strongest relationship you found?

红酒的quality和alcohol有着最强的相关性,从boxplot看出,alcohol越高,红酒的quality越高。

Multivariate Plots Section

# Plot the scatterplot of citric acid and volatile acidity, color by quality
ggplot(data=wine,aes(x=citric.acid, y=volatile.acidity, color=quality)) +
  geom_point(alpha=1, position='jitter') +
  scale_color_brewer(type='div')

# Plot the scatterplot of citric acid and volatile acidity, facet by quality
# Also add the smoothed conditional mean to the plots
ggplot(data=wine,aes(x=citric.acid, y=volatile.acidity, color=quality)) + 
  geom_point(alpha=0.5, position='jitter') +
  geom_smooth(method='lm') +
  facet_wrap(~quality) + 
  scale_color_brewer(type='div') +
  scale_x_continuous(breaks=c(0,0.25,0.5,0.75)) +
  theme(axis.text.x = element_text(size = 10), 
        axis.text.y = element_text(size = 10))

# Plot the scatterplot of citric acid and volatile acidity, facet by quality
# Show the smoothed conditional means in the same plot
ggplot(aes(x=citric.acid, y=volatile.acidity, color = quality), 
       data = wine) + 
  geom_point(alpha=0.2, position = 'jitter') +
  geom_smooth(method='lm', se=FALSE, size=1)

上面的boxplot解释了citric acid和不同的quality之间的关系。每一类quality,citric acid和volatile acidity都是负相关。说明了下面两点 1. 高quality的红酒有更低的volatile acidity 2. 对于每一类的quality,citric acid和volatile acidity负相关

# Plot the boxplots of citric.acid/fixed.acidity by quality
qplot(x = quality, y = citric.acid/fixed.acidity, data = wine, 
      geom = 'boxplot')

# Plot the histogram of citric.acid/fixed.acidity, color by quality
ggplot(data = wine, aes(x=citric.acid/fixed.acidity)) +
  geom_bar(aes(fill=quality))
## Warning: Computation failed in `stat_count()`:
## arguments imply differing number of rows: 429, 436, 1

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

在不同的quality分类下,citric acid和volatile acidity之间的关系进一步增强了。在每一类的quality下面,citric acid和volatile acidity都是负相关。使用citric acid和volatile acidity的线性模型用来预测quality。

Were there any interesting or surprising interactions between features?

citric acid和fixed acidity的比例,对于红酒的quality是一个很好的参考。高quality的红酒这个比例接近0.05。

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

# Plot the frequency polygon of citric acid
qplot(citric.acid, data = wine, color=I(color_fill), binwidth=0.01, 
      geom = 'freqpoly') +
  ggtitle('Frequency Polygon of Citric Acid') +
  xlab('Citric Acid (g / dm^3)') +
  ylab('Number of Samples') +
  theme(plot.title = element_text(size = 16))

Description One

citric acid出现多峰分布,有三个峰值出现在0, 0.25和0.5。样本含有大量的0值。

Plot Two

# Plot the scatterplot of citric acid and volatile acidity, facet by quality
# Show the smoothed conditional means in the same plot
ggplot(data = wine, aes(x=citric.acid, y=volatile.acidity, 
                        color = quality)) + 
  geom_point(alpha=0.7, position = 'jitter') +
  geom_smooth(method='lm', se=FALSE, size=1) +
  coord_cartesian(xlim = c(0, 0.8), ylim=c(0,1.25)) +
  ggtitle('Citric Acid / Volatile Acidity by Quality') +
  xlab('Citric Acid (g / dm^3)') +
  ylab('Volatile Acidity (g / dm^3)') +
  scale_color_discrete(name="Quality") +
  theme(plot.title = element_text(size = 16))

Description Two

高quality的红酒有更高的citric acid和更低的volatile acidity,citric acid和volatile acidity呈负相关。 可能的原因是citric acid和volatile acidity在某种条件下会互相转换。

Plot Three

Description Three

从图上可以看出,当volatile acidity大于1时,红酒的品质就不可能为excellent。当volatile acidity为0或者0.3时,红酒的品质有40%的可能性为excellent。但是当volatile acidity在1和1.2之间时,红酒的品质有80%的可能性为bad。然而当volatile acidity大于1.4时,红酒的品质100%是bad。因此volatile acidity是好的特征来检验红酒的品质是否为bad。


Reflection